deep reinforcement and infomax learning
Deep Reinforcement and InfoMax Learning
We posit that a reinforcement learning (RL) agent will perform better when it uses representations that are better at predicting the future, particularly in terms of few-shot learning and domain adaptation. To test that hypothesis, we introduce an objective based on Deep InfoMax (DIM) which trains the agent to predict the future by maximizing the mutual information between its internal representation of successive timesteps. We provide an intuitive analysis of the convergence properties of our approach from the perspective of Markov chain mixing times, and argue that convergence of the lower bound on mutual information is related to the inverse absolute spectral gap of the transition model. We test our approach in several synthetic settings, where it successfully learns representations that are predictive of the future. Finally, we augment C51, a strong distributional RL agent, with our temporal DIM objective and demonstrate on a continual learning task (inspired by Ms.~PacMan) and on the recently introduced Procgen environment that our approach improves performance, which supports our core hypothesis.
- North America > Canada > Quebec > Montreal (0.15)
- Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Research Report (0.46)
- Instructional Material (0.46)
Review for NeurIPS paper: Deep Reinforcement and InfoMax Learning
Strengths: The deep information maximization objective combined with noise contrastive estimation (InfoNCE) is a fairly new unsupervised learning loss that has yet to be thoroughly explored in deep reinforcement learning. The main value of the paper is the study of the representations learned when optimizing the InfoNCE loss and how those representations can be used for continual learning. Moreover, the paper introduces a novel architecture that uses the action information as part of the InfoNCE loss. These two ideas are novel and, to my knowledge, they haven't been presented in the literature before. In terms of significance, there has been growing interest in the representations learned by the InfoNCE loss in the context of reinforcement learning; see, Oord, Li, and Vinyals (2018), Anand et.
Review for NeurIPS paper: Deep Reinforcement and InfoMax Learning
This paper proposes a method to apply noise contrastive estimation for future state prediction as an auxiliary task for RL agents. The authors clearly explain their formulation and through toy experiments show it working as intended. There are some empirical improvements in performance in simple continual learning settings and also in Procgen. The author response contains very useful ablation studies and connection to prior work which I hope the authors consider adding to the final draft, as well as acknowledgement of moving theory sections to make exposition clearer.
Deep Reinforcement and InfoMax Learning
We posit that a reinforcement learning (RL) agent will perform better when it uses representations that are better at predicting the future, particularly in terms of few-shot learning and domain adaptation. To test that hypothesis, we introduce an objective based on Deep InfoMax (DIM) which trains the agent to predict the future by maximizing the mutual information between its internal representation of successive timesteps. We provide an intuitive analysis of the convergence properties of our approach from the perspective of Markov chain mixing times, and argue that convergence of the lower bound on mutual information is related to the inverse absolute spectral gap of the transition model. We test our approach in several synthetic settings, where it successfully learns representations that are predictive of the future. Finally, we augment C51, a strong distributional RL agent, with our temporal DIM objective and demonstrate on a continual learning task (inspired by Ms. PacMan) and on the recently introduced Procgen environment that our approach improves performance, which supports our core hypothesis.